10 research outputs found

    Online Diversity Control in Symbolic Regression via a Fast Hash-based Tree Similarity Measure

    Full text link
    Diversity represents an important aspect of genetic programming, being directly correlated with search performance. When considered at the genotype level, diversity often requires expensive tree distance measures which have a negative impact on the algorithm's runtime performance. In this work we introduce a fast, hash-based tree distance measure to massively speed-up the calculation of population diversity during the algorithmic run. We combine this measure with the standard GA and the NSGA-II genetic algorithms to steer the search towards higher diversity. We validate the approach on a collection of benchmark problems for symbolic regression where our method consistently outperforms the standard GA as well as NSGA-II configurations with different secondary objectives.Comment: 8 pages, conference, submitted to congress on evolutionary computatio

    Symbolic Regression with Fast Function Extraction and Nonlinear Least Squares Optimization

    Full text link
    Fast Function Extraction (FFX) is a deterministic algorithm for solving symbolic regression problems. We improve the accuracy of FFX by adding parameters to the arguments of nonlinear functions. Instead of only optimizing linear parameters, we optimize these additional nonlinear parameters with separable nonlinear least squared optimization using a variable projection algorithm. Both FFX and our new algorithm is applied on the PennML benchmark suite. We show that the proposed extensions of FFX leads to higher accuracy while providing models of similar length and with only a small increase in runtime on the given data. Our results are compared to a large set of regression methods that were already published for the given benchmark suite.Comment: Submitted manuscript to be published in Computer Aided Systems Theory - EUROCAST 2022: 18th International Conference, Las Palmas de Gran Canaria, Feb. 202

    Symbolic Regression in Materials Science: Discovering Interatomic Potentials from Data

    Full text link
    Particle-based modeling of materials at atomic scale plays an important role in the development of new materials and understanding of their properties. The accuracy of particle simulations is determined by interatomic potentials, which allow to calculate the potential energy of an atomic system as a function of atomic coordinates and potentially other properties. First-principles-based ab initio potentials can reach arbitrary levels of accuracy, however their aplicability is limited by their high computational cost. Machine learning (ML) has recently emerged as an effective way to offset the high computational costs of ab initio atomic potentials by replacing expensive models with highly efficient surrogates trained on electronic structure data. Among a plethora of current methods, symbolic regression (SR) is gaining traction as a powerful "white-box" approach for discovering functional forms of interatomic potentials. This contribution discusses the role of symbolic regression in Materials Science (MS) and offers a comprehensive overview of current methodological challenges and state-of-the-art results. A genetic programming-based approach for modeling atomic potentials from raw data (consisting of snapshots of atomic positions and associated potential energy) is presented and empirically validated on ab initio electronic structure data.Comment: Submitted to the GPTP XIX Workshop, June 2-4 2022, University of Michigan, Ann Arbor, Michiga

    Evolutionary Algorithms for Segment Optimization in Vectorial GP [Poster]

    Get PDF
    875441 Vektor-basierte Genetische Programmierung für Symbolische Regression und Klassifikation mit Zeitreihen (SymRegZeit), funded by the Austrian Research Promotion Agency FFG. It was also partially supported by FCT, Portugal, through funding of research units MagIC/NOVA IMS (UIDB/04152/2020) and LASIGE (UIDB/00408/2020 and UIDP/00408/2020).Vectorial Genetic Programming (Vec-GP) extends regular GP by allowing vectorial input features (e.g. time series data), while retaining the expressiveness and interpretability of regular GP. The availability of raw vectorial data during training, not only enables Vec-GP to select appropriate aggregation functions itself, but also allows Vec-GP to extract segments from vectors prior to aggregation (like windows for time series data). This is a critical factor in many machine learning applications, as vectors can be very long and only small segments may be relevant. However, allowing aggregation over segments within GP models makes the training more complicated. We explore the use of common evolutionary algorithms to help GP identify appropriate segments, which we analyze using a simplified problem that focuses on optimizing aggregation segments on fixed data. Since the studied algorithms are to be used in GP for local optimization (e.g. as mutation operator), we evaluate not only the quality of the solutions, but also take into account the convergence speed and anytime performance. Among the evaluated algorithms, CMA-ES, PSO and ALPS show the most promising results, which would be prime candidates for evaluation within GP.publishersversionpublishe

    Improved homology-driven computational validation of protein-protein interactions motivated by the evolutionary gene duplication and divergence hypothesis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Protein-protein interaction (PPI) data sets generated by high-throughput experiments are contaminated by large numbers of erroneous PPIs. Therefore, computational methods for PPI validation are necessary to improve the quality of such data sets. Against the background of the theory that most extant PPIs arose as a consequence of gene duplication, the sensitive search for homologous PPIs, i.e. for PPIs descending from a common ancestral PPI, should be a successful strategy for PPI validation.</p> <p>Results</p> <p>To validate an experimentally observed PPI, we combine FASTA and PSI-BLAST to perform a sensitive sequence-based search for pairs of interacting homologous proteins within a large, integrated PPI database. A novel scoring scheme that incorporates both quality and quantity of all observed matches allows us (1) to consider also tentative paralogs and orthologs in this analysis and (2) to combine search results from more than one homology detection method. ROC curves illustrate the high efficacy of this approach and its improvement over other homology-based validation methods.</p> <p>Conclusion</p> <p>New PPIs are primarily derived from preexisting PPIs and not invented <it>de novo</it>. Thus, the hallmark of true PPIs is the existence of homologous PPIs. The sensitive search for homologous PPIs within a large body of known PPIs is an efficient strategy to separate biologically relevant PPIs from the many spurious PPIs reported by high-throughput experiments.</p

    Local Optimization and Complexity Control for Symbolic Regression

    No full text
    Symbolic regression is a data-based machine learning approach that creates interpretable prediction models in the form of mathematical expressions without the necessity to specify the model structure in advance. Due to numerous possible models, symbolic regression problems are commonly solved by metaheuristics such as genetic programming. A drawback of this method is that because of the simultaneous optimization of the model structure and model parameters, the effort for learning from the presented data is increased and the obtained prediction accuracy could suffer. Furthermore, genetic programming in general has to deal with bloat, an increase in model length and complexity without an accompanying increase in prediction accuracy, which hampers the interpretability of the models. The goal of this thesis is to develop and present new methods for symbolic regression, which improve prediction accuracy, interpretability, and simplicity of the models. The prediction accuracy is improved by integrating local optimization techniques that adapt the numerical model parameters in the algorithm. Thus, the symbolic regression problem is divided into two separate subproblems: finding the most appropriate structure describing the data and finding optimal parameters for the specified model structure. Genetic programming excels at finding appropriate model structures, whereas the Levenberq-Marquardt algorithm performs least-squares curve fitting and model parameter tuning. The combination of these two methods significantly improves the prediction accuracy of generated models. Another improvement is to turn the standard single-objective formulation of symbolic regression into a multi-objective one, where the prediction accuracy is maximized while the model complexity is simultaneously minimized. As a result the algorithm does not produce a single solution, but a Pareto front of models with varying accuracy and complexity. In addition, a novel complexity measure for multi-objective symbolic regression is developed that includes syntactic and semantic information about the models while still being efficiently computed. By using this new complexity measure the generated models get simpler and the occurrence of bloat is reduced.Symbolische Regression ist ein datenbasiertes, maschinelles Lernverfahren bei dem Vorhersagemodelle in Form mathematischer Ausdrücke ohne vorgegebener Modellstruktur erstellt werden. Wegen der Vielzahl möglicher Modelle, welche die Daten beschreiben, werden symbolische Regressionsprobleme meist mittels genetischer Programmierung gelöst. Ein Nachteil dabei ist, dass wegen der gleichzeitigen Optimierung der Modellstruktur und deren Parameter, der Aufwand zum Lernen der Modelle erhöht ist und deren Genauigkeit verringert sein kann. Zusätzlich wird die Interpretierbarkeit der Modelle durch das Auftreten überflüssiger Ausdrücke (engl. bloat), welche die Modelle verkomplizieren ohne deren Genauigkeit zu erhöhen, erschwert. Das Ziel dieser Dissertation ist es neue Methoden zur Verbesserung der Genauigkeit und Interpretierbarkeit symbolischer Regressionsmodelle zu entwickeln. Die Genauigkeit der Modelle wird durch die Integration lokaler Optimierung, welche die numerischen Parameter der Modelle anpasst, erhöht. Dadurch wird das Regressionsproblem in zwei Aufgaben unterteilt. Zuerst wird eine passende Modellstruktur identifiziert und anschließend deren numerischen Parameter adaptiert. Genetische Programmierung wird zur Identifikation der Modellstruktur verwendet, während der Levenberg-Marquardt Algorithmus eine nichtlineare Anpassung der numerischen Parameter vornimmt. Durchgeführte Experimente zeigen, dass die Kombination dieser Methoden in einer deutlichen Verbesserung der Modellgenauigkeit resultiert. Die Interpretierbarkeit der Modelle wird durch eine Änderung der Problemformulierung von einzelkriterieller zu multikriterieller Optimierung verbessert, wodurch die Genauigkeit der Modelle maximiert während gleichzeitig deren Komplexität minimiert wird. Das Ergebnis ist somit nicht mehr ein einzelnes Modell, sondern eine Pareto-Front, welche den Kompromiss zwischen Genauigkeit und Komplexität widerspiegelt. Zusätzlich wird ein neues Komplexitätsmaß für symbolische Regression vorgestellt, welches syntaktische und semantische Informationen berücksichtigt. Durch den Einsatz dieses neuen Komplexitätsmaßes werden die erzeugten Modelle besser interpretierbar und überflüssige Ausdrücke vermieden.eingereicht von Michael KommendaUniversität Linz, Dissertation, 2018OeBB(VLID)258190

    Extended Regression Models for Predicting the Pumping Capability and Viscous Dissipation of Two-Dimensional Flows in Single-Screw Extrusion

    No full text
    Generally, numerical methods are required to model the non-Newtonian flow of polymer melts in single-screw extruders. Existing approximation equations for modeling the throughput&#8315;pressure relationship and viscous dissipation are limited in their scope of application, particularly when it comes to special screw designs. Maximum dimensionless throughputs of &#928; V &lt; 2.0 , implying minimum dimensionless pressure gradients &#928; p , z &#8805; &#8722; 0.5 for low power-law exponents are captured. We present analytical approximation models for predicting the pumping capability and viscous dissipation of metering channels for an extended range of influencing parameters ( &#928; p , z &#8805; &#8722; 1.0 , and t / D b &#8804; 2.4 ) required to model wave- and energy-transfer screws. We first rewrote the governing equations in dimensionless form, identifying three independent influencing parameters: (i) the dimensionless down-channel pressure gradient &#928; p , z , (ii) the power-law exponent n , and (iii) the screw-pitch ratio t / D b . We then carried out a parametric design study covering an extended range of the dimensionless influencing parameters. Based on this data set, we developed regression models for predicting the dimensionless throughput-pressure relationship and the viscous dissipation. Finally, the accuracy of all three models was proven using an independent data set for evaluation. We demonstrate that our approach provides excellent approximation. Our models allow fast, stable, and accurate prediction of both throughput-pressure behavior and viscous dissipation
    corecore